BMC Medical Research Methodology — Latest Matching Preprints

1

Simulation-Based Comparison of ControlledInterrupted Time Series (CITS) and Multivariable Regression

ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.

2026-04-13 health policy 10.64898/2026.04.10.26350670 medRxiv

Top 0.1%

22.7%

Show abstract

When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.

2

Protocol for LLM-Generated CONSORT Report for Increased Reporting: A Parallel-Arm Randomized Controlled Trial (Protocol)

Krauska, A. N.; Rohe, K.

2026-04-17 health policy 10.64898/2026.04.15.26350926 medRxiv

Top 0.1%

6.2%

Show abstract

Background Randomized controlled trials (RCTs) often have incomplete methods reporting despite widespread adoption of the CONSORT guideline. The editorial process is supposed to detect these shortcomings and request clarifications from authors, which is time-consuming. We developed an LLM-based CONSORT Rohe Nordberg Report that highlights which CONSORT items appear fully or partially reported and checks page references claimed by authors, and then creates follow up questions for authors to more easily correct missing information. Methods This parallel-arm, superiority RCT will randomize eligible RCT submissions (after desk screening) 1:1 into intervention (editorial team and authors receive the Rohe Nordberg Report) or control (standard editorial review only). The primary outcome is whether manuscripts improve their reporting of CONSORT items in the Methods and Results sections between the original submission and first revision. This will be assessed by blinded human reviewers who evaluate the textual changes for improvements between the original and revised manuscripts for each relevant CONSORT item. Secondary outcomes include time to editorial decisions, rejection and non-resubmission rates, if authors can correctly identify where CONSORT items are reported, and extent of revisions. Human evaluators will be blinded to whether the manuscript was in the intervention or control group. Discussion By providing authors and the editorial team with specific follow up questions for each underreported CONSORT item, we hypothesize that basic underreporting will be more efficiently detected and corrected. Using blinded human reviewers as the primary outcome assessors ensures a rigorous, unbiased evaluation. If successful, this approach may help align manuscripts more closely with CONSORT standards, ultimately benefiting evidence synthesis.

3

Causal estimands and target trials for the effect of lag time to treatment of cancer patients

Goncalves, B. P.; Franco, E. L.

2026-04-08 epidemiology 10.64898/2026.04.07.26350338 medRxiv

Top 0.2%

4.8%

Show abstract

Timeliness of therapy initiation is a fundamental determinant of outcomes for many medical conditions, most importantly, cancer. Yet, existing inefficiencies in healthcare systems mean that delays between diagnosis and treatment frequently adversely affect the clinical outcome for cancer patients. Although estimates of effects of lag time to therapy would be informative to policymakers considering resource allocation to minimize delays in oncology, causal methods are seldom explicitly discussed in epidemiologic analyses of these lag times. Here, we propose causal estimands for such studies, and outline the protocol of a target trial that could be emulated with observational data on lag times. To illustrate the application of this approach, we simulate studies of lag time to treatment under two scenarios: one in which indication bias (Waiting Time Paradox) is present and another in which it is absent. Although our discussion focuses on oncologic outcomes, components of the proposed target trial could be adapted to study delays for other medical conditions. We believe that the clarity with which causal questions are posed under the target trial emulation framework would lead to improved quantification of the effects of lag times in oncology, and hence to better informed policy decisions.

4

Assessing Compliance with Reporting Requirements in European Phase II to IV Clinical Trials: A Cross-Sectional Observational Study

Bruckner, T.; Dike, C. E.; Caquelin, L.; Freeman, A.; Aspromonti, D. A.; DeVito, N.; Song, Z.; Karam, G.; Nilsonne, G.

2026-04-05 health policy 10.64898/2026.04.03.26350111 medRxiv

Top 0.2%

4.4%

Show abstract

Objectives: To assess the availability of key clinical trial registration data and compliance with legal reporting requirements for all Phase 2-4 drug trials registered on the new European Clinical Trial Information System (CTIS) registry. This study is the first ever assessment of data quality and legal compliance with reporting requirements on CTIS. Design: Cross-sectional observational study of CTIS registry data combined with manual review of results documents. Setting: Cohort of all 7,547 Phase II-IV clinical trials registered on CTIS as of November 2025. Main outcome measures: Number and proportion of missing data points in CTIS registration data. Proportion of completed clinical trials that are compliant with regulatory reporting requirements. Results: Trial registration data quality was high overall with more than 99% of expected data present. Of 234 clinical trials legally required to report results, fewer than half (49.6%) fully reported results within the required timeframe, 20 trials (8.5%) fully reported results late, and 98 trials (41.9%) failed to fully report results. Legal compliance was similar for adult trials (79/158) and paediatric trials (37/76). Conclusions: Sponsor compliance with legal reporting requirements is weak. Current efforts by European regulators to monitor and enforce compliance appear to be insufficient. New results reporting functions currently being set up by trial registries worldwide will require quality assurance processes. Trial registration: Study protocol prospectively registered on OSF: https://osf.io/sn4j2/overview

5

Methodological Considerations in Sibling Analyses of Prenatal Acetaminophen

Ahlqvist, V. H.; Sjoqvist, H.; Gardner, R. M.; Lee, B. K.

2026-03-30 epidemiology 10.64898/2026.03.27.26349515 medRxiv

Top 0.2%

4.4%

Show abstract

Background: Sibling-matched designs control for shared familial confounding but remain vulnerable to non-shared confounders. Bi-directional sensitivity analyses, which stratify families by whether the older or younger sibling was exposed, are commonly used to assess carryover effects. We aimed to demonstrate how this methodological approach can introduce severe confounding by parity. Methods: We conducted simulations motivated by a recent epidemiological study. The true causal effect of a hypothetical exposure (prenatal acetaminophen) on neurodevelopmental outcomes was set to strictly null. To introduce parity-related confounding, baseline exposure and outcome probabilities were varied slightly by birth order. We compared conditional logistic regression effect estimates from total sibling models against bi-directional stratified models. Results: In the total simulated sibling cohort, models yielded the true null effect (odds ratio = 1.00) when adjusting for parity. However, the bi-directional analyses exhibited divergent artifactual signals. Because parity is perfectly collinear with exposure in these stratified subsets, it cannot be adjusted for. For example, when the older sibling was exposed, the odds ratio for autism spectrum disorder was 1.68; when the younger was exposed, the odds ratio was 0.60. Conclusions: Divergent estimates in bi-directional sibling analyses can be a predictable artifact of parity confounding rather than evidence of carryover effects or invalidating unmeasured bias. Overall sibling models adjusting for parity may remain robust despite divergent stratified sensitivity results.

6

Data sharing policies, requirements, and support from public and private clinical trial sponsors: a survey on top sponsors of clinical trials in Europe

Tai, K. H.; Varvara, G.; Escoffier, E.; Mansmann, U.; DeVito, N. J.; Vieira Armond, A. C.; Naudet, F.

2026-04-01 health informatics 10.64898/2026.03.31.26349853 medRxiv

Top 0.2%

4.3%

Show abstract

Objective To map the presence, public availability, and content of clinical trial data sharing policies (DSP), data management and sharing plans (DMSP), and data use agreements (DUA) among the most prolific public and private clinical trial sponsors operating in the European Union, and to identify key areas of convergence, divergence, and constraint in the context of General Data Protection Regulation (GDPR). Eligibility criteria We included organisation-level documents describing approaches to clinical trial data sharing or data management from the top 20 public and top 20 private sponsors ranked by the number of trials registered in the EU Clinical Trials Information System (CTIS). Eligible materials comprised publicly available or sponsor-shared policies, guidelines, statements, templates, and agreements relevant to clinical trial data sharing or management. Sources of evidence Evidence was identified through systematic searches of sponsors' public websites, structured Google searches, and major data management plan platforms (DMPTool, DMPonline, DMP Assistant), complemented by direct contact with sponsors to verify findings and request missing documentation. All sources were archived and catalogued. Charting methods Two reviewers independently extracted data using a structured form, capturing the existence, accessibility, and content of data sharing policies, data management and sharing plans, and data use agreements. Quantitative data were summarised descriptively, and a non-interpretive descriptive content analysis was conducted to characterise recurring policy elements and areas of heterogeneity. Results Among 40 sponsors, private sponsors were substantially more likely than public sponsors to make trial-specific data sharing policies and data use agreements publicly accessible, often via established data sharing platforms. Public sponsors more frequently referenced data management and sharing plans, but these were heterogeneous in scope and often embedded within broader institutional governance documents rather than tailored to clinical trials. Across sectors, GDPR compliance, data protection, and legal safeguards were emphasised, while operational aspects such as dataset readiness, review criteria, and downstream responsibilities varied widely. Overall response rate to sponsor verification was 37.5%. Conclusion Clinical trial data sharing governance in the EU shows a marked sectoral imbalance among the top sponsors. Private sponsors tend to provide more detailed and operationally explicit documentation, whereas public sponsors often articulate high-level commitments without trial-specific guidance. Greater clarity and standardisation, particularly among public sponsors, could improve transparency and facilitate responsible data reuse, while remaining compatible with GDPR requirements.

7

Demystifying Clone-Censor-Weight Method in Target Trial Emulation: A Real-World Study of HPV Vaccination Strategies

Lin, T.; Li, Y.; Huang, Z.; Gui, T. T.; Wang, W.; Guo, Y.

2026-04-22 health informatics 10.64898/2026.04.21.26351413 medRxiv

Top 0.2%

4.2%

Show abstract

Target trial emulation (TTE) offers a principled way to estimate treatment effects using real-world observational data, but analyses of time-varying treatment strategies remain vulnerable to immortal time bias. The clone-censor-weight (CCW) approach is increasingly used to address this problem, yet key aspects of its causal interpretation and implementation remain unclear. In this work, we emulate a target trial using electronic health records (EHRs) to compare completion of a 3-dose 9-valent human papillomavirus vaccination (HPV) series within 12 months versus remaining partially vaccinated among vaccine initiators. We link CCW to the classic potential outcome framework in causal inference, evaluate the role of different weighting mechanisms, and account for within-subject correlation induced by cloning using cluster-robust variance estimation. Our study provides practical guidance for applying CCW in real-world comparative effectiveness studies to address immortal time bias and supports more rigorous and interpretable treatment effect estimation in TTE.

8

Covariate adjustment for hierarchical outcomes and the win ratio: how to do it and is it worthwhile?

Hazewinkel, A.-D.; Gregson, J.; Bartlett, J. W.; Gasparyan, S. B.; Wright, D.; Pocock, S.

2026-03-31 cardiovascular medicine 10.64898/2026.03.30.26347966 medRxiv

Top 0.2%

4.1%

Show abstract

Objectives: Introducing a new covariate adjustment method for hierarchical outcomes using ordinal logistic regression, comparing it with existing approaches, and assessing whether adjustment improves power in randomized trials with hierarchical outcomes. Methods: We developed an ordinal regression-based method for covariate adjustment of the win ratio and compared it with three alternatives: probability index models, inverse probability weighting, and a randomization-based estimator. Methods were applied to the EMPEROR-Preserved rial and tested through extensive simulations involving two common hierarchical outcome structures: time-to-event composites, and composites combining time-to-event with quantitative measures. Simulations assessed impacts on estimates, standard errors, and power across prognostic and non-prognostic settings. Results: In RCT data and simulations, covariate adjustment consistently increased power when adjusting for prognostic baseline variables. Gains were comparable to or greater than those in conventional Cox models, with no power loss for non-prognostic covariates. Our ordinal approach performed similarly to existing methods while providing interpretable covariate effect estimates. Adjusting for baseline values of quantitative components yielded power gains according to the baseline-to-follow-up correlation. Conclusions: Covariate adjustment for prognostic variables meaningfully improves efficiency in win ratio analyses for hierarchical outcomes. Our ordinal method is easily implemented and facilitates covariate effect interpretation. We recommend the broader adoption of covariate adjustment and our ordinal method in randomized trials using hierarchical outcomes.

9

Predicting Depressive Symptoms Among Reproductive-Aged Women in Bangladesh Using Bagging Ensemble Machine Learning on Imbalanced Bangladesh Demographic and Health Survey 2022 Data

Mahmud, S.; Akter, M. S.; Ahamed, B.; Rahman, A. E.; El Arifeen, S.; Hossain, A. T.

2026-04-23 public and global health 10.64898/2026.04.22.26351445 medRxiv

Top 0.2%

3.9%

Show abstract

Background Depressive symptoms among reproductive-aged women represent a major public health concern in low- and middle-income countries, yet systematic screening remains limited. In most population survey datasets, the low prevalence of depression results in severe class imbalance, which challenges conventional machine learning models. Therefore, we develop and evaluate a bagging-based ensemble machine learning framework to predict depressive symptoms among reproductive-aged women using highly imbalanced Bangladesh demographic and health survey (BDHS) 2022 data. Methods The sample comprised women aged 15-49 years drawn from BDHS 2022 data. Depressive symptoms were defined using the Patient Health Questionnaire (PHQ-9 [≥]10). Candidate predictors were drawn from sociodemographic, reproductive, nutritional, psychosocial, healthcare access, and environmental domains. Feature selection was performed using Elastic Net (EN), Random Forest (RF), and XGBoost model. Five classifiers (EN, RF, Support Vector Machine (SVM), K-nearest neighbors (KNN), and Gradient Boosting Machine (GBM)) were trained using both oversampling-based approaches and the proposed ensemble framework. Model performance was evaluated on an independent test set using accuracy, sensitivity, specificity, F1-score, and the normalized Matthews correlation coefficient (normMCC). Results Approximately 4.8% of women were identified with depressive symptoms. The proposed bagging ensemble framework consistently achieved more balanced predictive performance than oversampling-based models. Average normMCC improved from 0.540 (oversampling) to 0.557 (ensemble). RF and GBM ensembles demonstrated notable improvements in identifying depressive cases, while the EN ensemble achieved the highest overall performance and sensitivity. Threshold optimization yielded stable normMCC across models, indicating robust trade-offs between sensitivity and specificity. Conclusions Bagging-based ensemble learning provides a more robust and balanced approach than synthetic oversampling for predicting depressive symptoms in highly imbalanced population survey data. This approach has important implications for improving early identification and population-level mental health surveillance in resource-constrained settings.

10

Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods: Protocol for an adaptive platform study within reviews

Gartlehner, G.; Banda, S.; Callaghan, M.; Chase, J.-A.; Dobrescu, A.; Eisele-Metzger, A.; Flemyng, E.; Gardner, S.; Griebler, U.; Helfer, B.; Jemiolo, P.; Macura, B.; Minx, J. C.; Noel-Storr, A.; Rajabzadeh Tahmasebi, N.; Sharifan, A.; Meerpohl, J.; Thomas, J.

2026-04-15 health informatics 10.64898/2026.04.13.26350802 medRxiv

Top 0.3%

3.5%

Show abstract

Background: Artificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews. Methods: Members of the Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods Project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles. Results: The study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations. Discussion: This study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.

11

A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv

Top 0.3%

3.5%

Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

12

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muneeb, M.; Ascher, D.

2026-04-02 bioinformatics 10.64898/2026.04.02.716079 medRxiv

Top 0.3%

3.2%

Show abstract

ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG

13

Strategic Point Coverage for Scorpion Accident Care: Methodological Considerations and Application in Sao Paulo State, Brazil

Pereira dos Santos, G.; Gonzalez-Araya, M. C.; Gomez-Lagos, J. E.; Dias de Freitas, G.; de Oliveira, A.; de Azevedo, T. S.; Santos Dourado, F.; Lacerda, A. B.; de Jesus Leal, E.; Candido, D. M.; Hui Wen, F.; Lorenz, C.; Chiaravalloti Neto, F.

2026-03-31 epidemiology 10.64898/2026.03.30.26349723 medRxiv

Top 0.4%

2.7%

Show abstract

Scorpionism is a public health concern in warm regions, particularly affecting children under 10 years old. Timely treatment with antivenom, provided free by the Brazilian Unified Health System, at strategic care points (PEs) is crucial to prevent avoidable deaths. Our study focused on the Sao Paulo state (SP), which has the largest population in Brazil. The objectives were to adapt a network analysis method suited to SPs context; to assess the efficiency of the SP PE network coverage, considering the 90-minute response time; and to determine the ideal number of vials to be stored at each PE. After adapting the healthcare network analysis, we applied spatial coverage models to evaluate the adequacy of PE response times. We also estimated the demand for antivenom vials at each PE based on Notifiable Diseases Information System data from 2021 to 2023, which is currently limited to the state level. We identified 12 areas lacking coverage, of which only one was suitable for a new PE. The estimated serum requirements aligned with SP's current distributions. However, the estimation carried out according to the PEs has the advantage of reducing the risk of antivenom shortages, especially in emergencies, thus ensuring timely care to prevent avoidable deaths. Our adapted method and PE serum estimates can enhance the scorpion sting care system by supporting geographic planning and optimizing resource allocation. Moreover, these findings and methodologies have potential applicability to other Brazilian regions and warm countries facing similar challenges, contributing to improved access and outcomes for scorpionism victims.

14

Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology 10.64898/2026.04.21.26351087 medRxiv

Top 0.4%

2.7%

Show abstract

Background: The increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. Methods: EST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026. Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. Results: The dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. Conclusions: EST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations.

15

Mediation analysis in longitudinal data: an unbiased estimator for cumulative indirect effect

Li, Y.; Cabral, H.; Tripodis, Y.; Ma, J.; Levy, D.; Joehanes, R.; Liu, C.; Lee, J.

2026-04-20 epidemiology 10.64898/2026.04.18.26351189 medRxiv

Top 0.4%

2.6%

Show abstract

Mediation analysis quantifies how an exposure affects an outcome through an intermediate variable. We extend mediation analysis to capture the cumulative effects of longitudinal predictors on longitudinal outcomes. Our proposed model examines how mediators transmit the effects of the current and previous exposure on the current outcome. We construct a least-squared estimator for cumulative indirect effect (CIE) and used three approaches (exact form, delta method, and bootstrap procedure) to estimate its standard error (SE). The estimator of CIE is unbiased with no unmeasured confounding and independent model errors between mediator model and outcome model at all time points, as shown in statistical inference and in simulations. While three SE estimates are numerically similar, bootstrap procedure is recommended due to its simplicity in implementation. We apply this method to Framingham Heart Study offspring cohort to assess if DNA methylation mediates the association of alcohol consumption with systolic blood pressure over two time points. We identify two CpGs (cg05130679 and cg05465916) as mediators and construct a composite DNA methylation score from 11 CpGs, which mediates for 39% of the cumulative effect. In conclusion, we propose an unbiased estimator for CIE. Future studies will investigate the missingness in mediators and outcomes.

16

A Reproducible Health Informatics Pipeline for Simulating and Integrating Early-Phase Oncology Clinical, Biomarker, and Pharmacokinetic Data for Exploratory Decision-Support Analytics

Petalcorin, M. I. R.

2026-04-02 health informatics 10.64898/2026.03.27.26349538 medRxiv

Top 0.4%

2.5%

Show abstract

Background: Early-phase oncology development increasingly depends on integrated interpretation of clinical outcomes, translational biomarkers, and pharmacokinetic exposure rather than toxicity alone. This shift has created a need for reproducible analytical workflows that can combine heterogeneous trial data into traceable, analysis-ready outputs suitable for exploratory review and early decision support. Objective: To develop a reproducible Python-based workflow that simulates a plausible early-phase oncology study, integrates clinical, biomarker, and pharmacokinetic data, and generates analysis-ready datasets, visual summaries, and exploratory predictive models relevant to early development analytics. Methods: A workflow was constructed to simulate an early-phase oncology cohort of 120 patients distributed across multiple dose levels. Three synthetic raw data sources were generated, including patient-level clinical data, baseline biomarker data, and longitudinal pharmacokinetic profiles. These sources were merged into a single analysis-ready dataset containing derived variables such as tumor percent change from baseline, clinical-benefit status, exposure summaries, adverse-event indicators, and survival outcomes. The workflow produced structured tables, patient listings, waterfall plots, Kaplan-Meier-style survival curves, biomarker-response visualizations, pharmacokinetic profile plots, and exploratory machine-learning outputs. Results: The final integrated dataset contained 120 patients and 30 variables. Median survival across the simulated cohort was 243.8 days, and higher dose groups showed improved median survival and greater clinical benefit relative to the low-dose group. Clinical benefit increased from 8.6% in the low-dose group to 29.0% in the medium-dose group and 45.2% in the high-dose group. Higher baseline LDH, CRP, and ctDNA fraction tracked with less favorable tumor-response trajectories, whereas higher exposure, reflected by AUC and Cmax, associated with improved disease control. Pharmacokinetic profiles showed clear dose-dependent separation. Grade 3 or higher adverse-event rates remained within a plausible exploratory range across dose groups. A random-forest model for clinical benefit achieved an exploratory ROC AUC of 0.845, while a logistic-regression model for strict responder status could not be fit because no simulated patient met the prespecified objective response threshold. Conclusions: This proof-of-concept demonstrates that a transparent Python workflow can generate a coherent early-phase oncology analytical ecosystem from synthetic inputs. The workflow supports integration of heterogeneous data streams, derivation of analysis-ready variables, production of interpretable outputs, and exploratory modeling in a reproducible framework. Although the simulated responder prevalence was too low to support objective response modeling, this limitation itself highlights the importance of simulation calibration for downstream analytical validity. The framework provides a practical Health Informatics demonstration of how early oncology trial data can be structured and analyzed for exploratory translational decision support.

17

Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models

Wang, X.; Faviez, C.; Vincent, M.; Andrew, J. J.; Le Priol, E.; Saunier, S.; Knebelmann, B.; Zhang, R.; Garcelon, N.; Burgun, A.; Chen, X.

2026-03-31 health informatics 10.64898/2026.03.30.26349388 medRxiv

Top 0.4%

2.1%

Show abstract

Objectives Rare diseases often require longitudinal monitoring to characterise progression, yet much clinical information remains locked in unstructured electronic health records (EHRs). Efficient recovery of such data is critical for accurate prognostic modelling and clinical trial preparation. We aimed to develop and evaluate a small language model (SLM)-based pipeline for extracting longitudinal information from French clinical notes of patients with rare kidney diseases. Methods As a use case, we focused on serum creatinine, a key biomarker of kidney function. We analyzed 81 clinical notes comprising 200 measurements (triplet of date, value and unit). Four open-source SLMs (Mistral-7B, Llama-3.2-3B, Qwen3-4B, Qwen3-8B) were systematically tested with different prompting strategies in French and English. Outputs were post-processed to standardize formats and resolve inconsistencies, and performance was assessed across model size, prompting, language, and robustness to text duplication. Results All SLMs extracted structured triplets, with F1-scores ranging from 0.519 to 0.928 (Qwen3-8B), outperforming the rule-based baseline. Larger models generally performed better, while prompting strategy and language had modest effects across models. SLMs also showed variable robustness to duplicated content common in real-world EHR notes. Discussion Lightweight, locally deployable language models can accurately extract longitudinal biomarkers from unstructured clinical notes. Our findings highlight their practicality for rare diseases where data scarcity often limits task-specific model training. Conclusion SLMs provide a privacy-preserving and resource-efficient solution for recovering longitudinal biomarker trajectories from unstructured notes, offering potential to advance real-world research and patient care in rare kidney diseases.

18

An Empirical Assessment of Inferential Reproducibility of Linear Regression in Health and Biomedical Research Papers

Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350296 medRxiv

Top 0.4%

2.1%

Show abstract

Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.

19

Electronic Health Record-Based Estimation of Kansas City Cardiomyopathy Questionnaire Scores in Heart Failure

Kim, Y. W.; Lau, W.; Patel, N.; Kendrick, K.; Wu, A.; Feldman, T.; Ahern, R.; Oka, A.

2026-04-05 health informatics 10.64898/2026.04.03.26350138 medRxiv

Top 0.4%

2.1%

Show abstract

Background: The Kansas City Cardiomyopathy Questionnaire (KCCQ) is a validated patient-reported outcome measure for heart failure. However, its clinical utility is limited by incomplete and inconsistent data collection. We aimed to develop and validate machine learning models to estimate KCCQ overall summary scores from electronic health record (EHR) data. Methods: We assembled a retrospective cohort of 10,889 heart failure patients with recorded KCCQ scores from the Truveta database. Predictor features were derived from structured EHR variables across 13 historical time windows (15-360 days). Multiple regression algorithms were evaluated, followed by SHapley Additive exPlanations (SHAP)-based feature reduction and nested cross-validation for hyperparameter optimization. Model performance was assessed using the coefficient of determination (R2), mean absolute error (MAE), and ordinal discrimination and calibration for categorical severity classification. Results: Histogram-based gradient boosting (HGB) with HGB-SHAP feature selection achieved the strongest performance, reducing feature dimensionality by more than 94\% while maintaining estimation accuracy. The 240-day window performed best (R2=0.522, MAE=12.485). For categorical severity classification, the model demonstrated strong ordinal discrimination (mean ordinal AUROC=0.850). Quantile-based calibration improved classification balance, increasing the F1-score for the most severe category (KCCQ<25) from 0.180 to 0.428 and the quadratic weighted kappa from 0.601 to 0.640. Longer EHR observation windows were associated with improved prediction performance. Conclusion: Machine learning models can estimate KCCQ scores from routine EHR data with clinically meaningful accuracy and strong discriminatory performance. This approach may help extend assessment of patient-reported health status to populations in which survey-based data are incompletely captured, supporting population-level cardiovascular outcomes assessment and risk stratification in heart failure care.

20

Episia: An Open-Source Python Library for Epidemiological Surveillance, Modeling, and Biostatistics in Resource-Limited Settings

Ouedraogo, F. A. S.

2026-04-20 epidemiology 10.64898/2026.04.17.26350337 medRxiv

Top 0.4%

2.1%

Show abstract

Despite the evolution of epidemiological analysis and modeling tools, difficulties still remain, especially in developing countries, regarding the availability and use of these tools. Often expensive, requiring high technical expertise, demanding constant connectivity of several or sometimes even significant resources, these tools, although efficient, present a major gap with the operational realities of health districts. It is in this context that we introduce Episia, an open-source Python library designed and conceived to provide a framework to facilitate epidemiological analysis and modeling. It integrates a suite of compartmental epidemic models (SIR, SEIR, SEIRD) with a sensitivity analysis using the Monte Carlo method, a complete biostatistics suite validated against the OpenEpi reference standard, as well as a native DHIS2 client for automated data ingestion. Developed in Burkina Faso, it is optimized and aims not only to address these health challenges encountered in Africa but also remains a versatile tool for global health informatics.